Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shangchen Zhou

PnP-U3D: Plug-and-Play 3D Framework Bridging Autoregression and Diffusion for Unified Understanding and Generation

Feb 03, 2026

Yongwei Chen, Tianyi Wei, Yushi Lan, Zhaoyang Lyu, Shangchen Zhou, Xudong Xu, Xingang Pan

Abstract:The rapid progress of large multimodal models has inspired efforts toward unified frameworks that couple understanding and generation. While such paradigms have shown remarkable success in 2D, extending them to 3D remains largely underexplored. Existing attempts to unify 3D tasks under a single autoregressive (AR) paradigm lead to significant performance degradation due to forced signal quantization and prohibitive training cost. Our key insight is that the essential challenge lies not in enforcing a unified autoregressive paradigm, but in enabling effective information interaction between generation and understanding while minimally compromising their inherent capabilities and leveraging pretrained models to reduce training cost. Guided by this perspective, we present the first unified framework for 3D understanding and generation that combines autoregression with diffusion. Specifically, we adopt an autoregressive next-token prediction paradigm for 3D understanding, and a continuous diffusion paradigm for 3D generation. A lightweight transformer bridges the feature space of large language models and the conditional space of 3D diffusion models, enabling effective cross-modal information exchange while preserving the priors learned by standalone models. Extensive experiments demonstrate that our framework achieves state-of-the-art performance across diverse 3D understanding and generation benchmarks, while also excelling in 3D editing tasks. These results highlight the potential of unified AR+diffusion models as a promising direction for building more general-purpose 3D intelligence.

* Yongwei Chen and Tianyi Wei contributed equally. Project page: https://cyw-3d.github.io/PnP-U3D/

Via

Access Paper or Ask Questions

Zoom-IQA: Image Quality Assessment with Reliable Region-Aware Reasoning

Jan 06, 2026

Guoqiang Liang, Jianyi Wang, Zhonghua Wu, Shangchen Zhou

Abstract:Image Quality Assessment (IQA) is a long-standing problem in computer vision. Previous methods typically focus on predicting numerical scores without explanation or provide low-level descriptions lacking precise scores. Recent reasoning-based vision language models (VLMs) have shown strong potential for IQA, enabling joint generation of quality descriptions and scores. However, we notice that existing VLM-based IQA methods tend to exhibit unreliable reasoning due to their limited capability of integrating visual and textual cues. In this work, we introduce Zoom-IQA, a VLM-based IQA model to explicitly emulate key cognitive behaviors: uncertainty awareness, region reasoning, and iterative refinement. Specifically, we present a two-stage training pipeline: 1) supervised fine-tuning (SFT) on our Grounded-Rationale-IQA (GR-IQA) dataset to teach the model to ground its assessments in key regions; and 2) reinforcement learning (RL) for dynamic policy exploration, primarily stabilized by our KL-Coverage regularizer to prevent reasoning and scoring diversity collapse, and supported by a Progressive Re-sampling Strategy to mitigate annotation bias. Extensive experiments show that Zoom-IQA achieves improved robustness, explainability, and generalization. The application to downstream tasks, such as image restoration, further demonstrates the effectiveness of Zoom-IQA.

* Project Page: https://ethanliang99.github.io/ZOOMIQA-Projectpage

Via

Access Paper or Ask Questions

MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator

Dec 12, 2025

Peiqing Yang, Shangchen Zhou, Kai Hao, Qingyi Tao

Figure 1 for MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator

Figure 2 for MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator

Figure 3 for MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator

Figure 4 for MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator

Abstract:Video matting remains limited by the scale and realism of existing datasets. While leveraging segmentation data can enhance semantic stability, the lack of effective boundary supervision often leads to segmentation-like mattes lacking fine details. To this end, we introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality of alpha mattes without ground truth. It produces a pixel-wise evaluation map that identifies reliable and erroneous regions, enabling fine-grained quality assessment. The MQE scales up video matting in two ways: (1) as an online matting-quality feedback during training to suppress erroneous regions, providing comprehensive supervision, and (2) as an offline selection module for data curation, improving annotation quality by combining the strengths of leading video and image matting models. This process allows us to build a large-scale real-world video matting dataset, VMReal, containing 28K clips and 2.4M frames. To handle large appearance variations in long videos, we introduce a reference-frame training strategy that incorporates long-range frames beyond the local window for effective training. Our MatAnyone 2 achieves state-of-the-art performance on both synthetic and real-world benchmarks, surpassing prior methods across all metrics.

* Project page: https://pq-yang.github.io/projects/MatAnyone2/

Via

Access Paper or Ask Questions

STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer

Aug 14, 2025

Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, Xingang Pan

Abstract:We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem. Existing state-of-the-art methods for multi-view reconstruction either depend on expensive global optimization or rely on simplistic memory mechanisms that scale poorly with sequence length. In contrast, STream3R introduces an streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling. By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios, including dynamic scenes where traditional methods often fail. Extensive experiments show that our method consistently outperforms prior work across both static and dynamic scene benchmarks. Moreover, STream3R is inherently compatible with LLM-style training infrastructure, enabling efficient large-scale pretraining and fine-tuning for various downstream 3D tasks. Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments. More details can be found in our project page: https://nirvanalan.github.io/projects/stream3r.

* TL;DR: Streaming 4D reconstruction using causal transformer. Project page: https://nirvanalan.github.io/projects/stream3r

Via

Access Paper or Ask Questions

SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training

Jun 05, 2025

Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang(+3 more)

Figure 1 for SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training

Figure 2 for SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training

Figure 3 for SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training

Figure 4 for SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training

Abstract:Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference. While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution video in real-world settings. In this work, we propose a one-step diffusion-based VR model, termed as SeedVR2, which performs adversarial VR training against real data. To handle the challenging high-resolution VR within a single step, we introduce several enhancements to both model architecture and training procedures. Specifically, an adaptive window attention mechanism is proposed, where the window size is dynamically adjusted to fit the output resolutions, avoiding window inconsistency observed under high-resolution VR using window attention with a predefined window size. To stabilize and improve the adversarial post-training towards VR, we further verify the effectiveness of a series of losses, including a proposed feature matching loss without significantly sacrificing training efficiency. Extensive experiments show that SeedVR2 can achieve comparable or even better performance compared with existing VR approaches in a single step.

* Draft Ver. Project page: https://iceclear.github.io/projects/seedvr2/

Via

Access Paper or Ask Questions

ObjectClear: Complete Object Removal via Object-Effect Attention

May 28, 2025

Jixin Zhao, Shangchen Zhou, Zhouxia Wang, Peiqing Yang, Chen Change Loy

Abstract:Object removal requires eliminating not only the target object but also its effects, such as shadows and reflections. However, diffusion-based inpainting methods often produce artifacts, hallucinate content, alter background, and struggle to remove object effects accurately. To address this challenge, we introduce a new dataset for OBject-Effect Removal, named OBER, which provides paired images with and without object effects, along with precise masks for both objects and their associated visual artifacts. The dataset comprises high-quality captured and simulated data, covering diverse object categories and complex multi-object scenes. Building on OBER, we propose a novel framework, ObjectClear, which incorporates an object-effect attention mechanism to guide the model toward the foreground removal regions by learning attention masks, effectively decoupling foreground removal from background reconstruction. Furthermore, the predicted attention map enables an attention-guided fusion strategy during inference, greatly preserving background details. Extensive experiments demonstrate that ObjectClear outperforms existing methods, achieving improved object-effect removal quality and background fidelity, especially in complex scenarios.

* Project page: https://zjx0101.github.io/projects/ObjectClear/

Via

Access Paper or Ask Questions

Bokeh Diffusion: Defocus Blur Control in Text-to-Image Diffusion Models

Mar 13, 2025

Armando Fortes, Tianyi Wei, Shangchen Zhou, Xingang Pan

Abstract:Recent advances in large-scale text-to-image models have revolutionized creative fields by generating visually captivating outputs from textual prompts; however, while traditional photography offers precise control over camera settings to shape visual aesthetics -- such as depth-of-field -- current diffusion models typically rely on prompt engineering to mimic such effects. This approach often results in crude approximations and inadvertently altering the scene content. In this work, we propose Bokeh Diffusion, a scene-consistent bokeh control framework that explicitly conditions a diffusion model on a physical defocus blur parameter. By grounding depth-of-field adjustments, our method preserves the underlying scene structure as the level of blur is varied. To overcome the scarcity of paired real-world images captured under different camera settings, we introduce a hybrid training pipeline that aligns in-the-wild images with synthetic blur augmentations. Extensive experiments demonstrate that our approach not only achieves flexible, lens-like blur control but also supports applications such as real image editing via inversion.

* Project page: https://atfortes.github.io/projects/bokeh-diffusion/

Via

Access Paper or Ask Questions

MatAnyone: Stable Video Matting with Consistent Memory Propagation

Jan 24, 2025

Peiqing Yang, Shangchen Zhou, Jixin Zhao, Qingyi Tao, Chen Change Loy

Figure 1 for MatAnyone: Stable Video Matting with Consistent Memory Propagation

Figure 2 for MatAnyone: Stable Video Matting with Consistent Memory Propagation

Figure 3 for MatAnyone: Stable Video Matting with Consistent Memory Propagation

Figure 4 for MatAnyone: Stable Video Matting with Consistent Memory Propagation

Abstract:Auxiliary-free human video matting methods, which rely solely on input frames, often struggle with complex or ambiguous backgrounds. To address this, we propose MatAnyone, a robust framework tailored for target-assigned video matting. Specifically, building on a memory-based paradigm, we introduce a consistent memory propagation module via region-adaptive memory fusion, which adaptively integrates memory from the previous frame. This ensures semantic stability in core regions while preserving fine-grained details along object boundaries. For robust training, we present a larger, high-quality, and diverse dataset for video matting. Additionally, we incorporate a novel training strategy that efficiently leverages large-scale segmentation data, boosting matting stability. With this new network design, dataset, and training strategy, MatAnyone delivers robust and accurate video matting results in diverse real-world scenarios, outperforming existing methods.

* Project page: https://pq-yang.github.io/projects/MatAnyone

Via

Access Paper or Ask Questions

3DEnhancer: Consistent Multi-View Diffusion for 3D Enhancement

Dec 24, 2024

Yihang Luo, Shangchen Zhou, Yushi Lan, Xingang Pan, Chen Change Loy

Abstract:Despite advances in neural rendering, due to the scarcity of high-quality 3D datasets and the inherent limitations of multi-view diffusion models, view synthesis and 3D model generation are restricted to low resolutions with suboptimal multi-view consistency. In this study, we present a novel 3D enhancement pipeline, dubbed 3DEnhancer, which employs a multi-view latent diffusion model to enhance coarse 3D inputs while preserving multi-view consistency. Our method includes a pose-aware encoder and a diffusion-based denoiser to refine low-quality multi-view images, along with data augmentation and a multi-view attention module with epipolar aggregation to maintain consistent, high-quality 3D outputs across views. Unlike existing video-based approaches, our model supports seamless multi-view enhancement with improved coherence across diverse viewing angles. Extensive evaluations show that 3DEnhancer significantly outperforms existing methods, boosting both multi-view enhancement and per-instance 3D optimization tasks.

* Project page: https://yihangluo.com/projects/3DEnhancer

Via

Access Paper or Ask Questions

ObjCtrl-2.5D: Training-free Object Control with Camera Poses

Dec 10, 2024

Zhouxia Wang, Yushi Lan, Shangchen Zhou, Chen Change Loy

Figure 1 for ObjCtrl-2.5D: Training-free Object Control with Camera Poses

Figure 2 for ObjCtrl-2.5D: Training-free Object Control with Camera Poses

Figure 3 for ObjCtrl-2.5D: Training-free Object Control with Camera Poses

Figure 4 for ObjCtrl-2.5D: Training-free Object Control with Camera Poses

Abstract:This study aims to achieve more precise and versatile object control in image-to-video (I2V) generation. Current methods typically represent the spatial movement of target objects with 2D trajectories, which often fail to capture user intention and frequently produce unnatural results. To enhance control, we present ObjCtrl-2.5D, a training-free object control approach that uses a 3D trajectory, extended from a 2D trajectory with depth information, as a control signal. By modeling object movement as camera movement, ObjCtrl-2.5D represents the 3D trajectory as a sequence of camera poses, enabling object motion control using an existing camera motion control I2V generation model (CMC-I2V) without training. To adapt the CMC-I2V model originally designed for global motion control to handle local object motion, we introduce a module to isolate the target object from the background, enabling independent local control. In addition, we devise an effective way to achieve more accurate object control by sharing low-frequency warped latent within the object's region across frames. Extensive experiments demonstrate that ObjCtrl-2.5D significantly improves object control accuracy compared to training-free methods and offers more diverse control capabilities than training-based approaches using 2D trajectories, enabling complex effects like object rotation. Code and results are available at https://wzhouxiff.github.io/projects/ObjCtrl-2.5D/.

* Project Page: https://wzhouxiff.github.io/projects/ObjCtrl-2.5D/

Via

Access Paper or Ask Questions